Machine learning-based prediction of critical illness in children visiting the emergency department

Objectives Triage is an essential emergency department (ED) process designed to provide timely management depending on acuity and severity; however, the process may be inconsistent with clinical and hospitalization outcomes. Therefore, studies have attempted to augment this process with machine learning models, showing advantages in predicting critical conditions and hospitalization outcomes. The aim of this study was to utilize nationwide registry data to develop a machine learning-based classification model to predict the clinical course of pediatric ED visits. Methods This cross-sectional observational study used data from the National Emergency Department Information System on emergency visits of children under 15 years of age from January 1, 2016, to December 31, 2017. The primary and secondary outcomes were to identify critically ill children and predict hospitalization from triage data, respectively. We developed and tested a random forest model with the under sampled dataset and validated the model using the entire dataset. We compared the model’s performance with that of the conventional triage system. Results A total of 2,621,710 children were eligible for the analysis and included 12,951 (0.5%) critical outcomes and 303,808 (11.6%) hospitalizations. After validation, the area under the receiver operating characteristic curve was 0.991 (95% confidence interval [CI] 0.991–0.992) for critical outcomes and 0.943 (95% CI 0.943–0.944) for hospitalization, which were higher than those of the conventional triage system. Conclusions The machine learning-based model using structured triage data from a nationwide database can effectively predict critical illness and hospitalizations among children visiting the ED.


Introduction
In the emergency department (ED), triage is the first and most important step and classifies patients according to acuity and severity [1]. However, triage classifications tend to be similar but not always identical to ED or hospitalization outcomes [2,3] because triage systems are designed to provide timely and appropriate treatment in a resource-limited ED environment [4][5][6], not to predict the clinical outcome of the patient. However, early identification of patients at risk of deterioration is also a topic of interest for many individuals. Thus, studies have tried to predict critical or hospitalization outcomes from EDs [7][8][9][10][11][12].
The Pediatric Early Warning Score (PEWS) is an example of a scoring system used to detect children who are in need of intensive care unit (ICU) admission [13]. PEWS was originally developed and validated in the inpatient setting [14,15], but some validation in the ED setting was attempted [16,17]. Another attempt at the hospitalization prediction scoring system is the Pediatric Risk of Admission (PRISA) score [11,12]. The PRISA score was developed to predict hospitalization in the pediatric ED, but this scoring system is composed of 21 components gathered after initial evaluation, including therapies, which makes it difficult to apply at the initial presentation to the ED.
However, machine learning began to augment medical research, and various studies have attempted to introduce prediction models using machine learning. Machine learning models, such as random forest (RF), gradient boosting, and deep neural network methods, are able to handle large datasets effectively and have been shown to predict clinical outcomes more accurately than traditional methods for patients in the ICU and patients with sepsis [18][19][20][21][22]. Additionally, some studies have demonstrated that machine learning models can offer advantages in predicting critical condition and hospitalization outcomes [23][24][25][26][27], even in the pediatric population [28,29].
In this study, we used nationwide data from the National Emergency Department Information System (NEDIS) to develop a machine learning-based classification model to predict the clinical course of pediatric ED visitors. We also compared the performance of the derived machine learning model with that of the conventional pediatric triage system of South Korea (pediatric Korean Triage and Acuity Scale [pedKTAS]). In addition, we sought to define the importance of factors that predict critical cases and hospitalization among the selected predictor variables used in the analysis.

Study design and setting
This is a cross-sectional observational study investigating pediatric patients visiting the ED in South Korea using nationwide registry data. The Korean Triage and Acuity Scale (KTAS) is a 5-level triage system (from level 1 the most critical to level 5 the nonurgent) that was developed based on the Canadian Triage and Acuity Scale (CTAS). This scale has been used since its introduction in 2016 and has shown adequate reliability and validity [30]. The KTAS is divided into adult and pedKTAS based on an age cutoff of 15 years. We included emergency visits by children under 15 years of age from January 1, 2016, to December 31, 2017, which was after the pedKTAS was introduced and established in South Korea.
We obtained data from the NEDIS, which is a national database that was developed in 2004 and collects information from more than 400 EDs across South Korea. The NEDIS database contains various types of information, such as patient age, sex, type of insurance, means of transportation, level of consciousness at presentation, time variables (visit, discharge, and admission), and vital signs at presentation. The NEDIS also provides information about ED disposition and final outcomes of each ED visit (information regarding discharge, transfer, and death). All patients arriving at the ED must be enrolled in the system. All patient-related information from ED arrival to discharge from hospital is transferred automatically from each ED to a central server, and inaccurate data are filtered by a data processing system. NEDIS data are available upon formal request and provided by the National Emergency Medical Center (data acquisition number: N20192821211).
The Institutional Review Board of Seoul National University Hospital approved this study (IRB No. E-1909-098-1065) with a waiver of consent. Patients or the public were not involved in the design, conduct, reporting, or dissemination plans of our research.

Outcomes
The primary outcome of this study was the prediction of critically ill children (critical cases) from triage data. Critical cases were defined as 1) children who were admitted to the ICU or transferred for ICU admission, 2) children who received cardiopulmonary resuscitation during their ED stay, and 3) children who died in the ED. The secondary outcome was the identification of children who could not be discharged directly from the ED (hospitalization) from triage data. Hospitalization was defined as including both admission to ICUs and general wards.

Predictor variables and preprocessing
Demographic information, such as patient age and sex, was collected. Vital signs (blood pressure, heart rate, respiratory rate, body temperature and oxygen saturation) and consciousness level measured on the AVPU scale (alert, verbal response, response to pain, and unresponsive) at triage, transportation method, reason for ED visit (traumatic or nontraumatic), ED visit time and time from onset were also collected. A detailed list of variables used in the development of the model is shown in S1 Table. Data on vital signs were preprocessed for machine learning because the normal values of some vital signs vary depending on age (such as blood pressure, heart rate, and respiratory rate). The Z scores of these age-dependent variables were calculated for each age range for adjustment before the final analysis. Categorical variables with low cardinality (sex and level of consciousness) were one-hot encoded. Missing values for continuous variables were imputed as the means of the nonmissing values of each corresponding variable, and missing values for categorical variables were coded as "Not Available", representing an additional category (using one-hot encoding).

Training of machine learning classifiers
In this study, we used RF to identify critically ill children and predict hospitalization. In machine learning (ML), algorithms are often not interpretable, the so-called "black box phenomenon". However, compared to the "black box" models, interpretable models have shown technical equivalence [31,32]. The RF algorithm can calculate the importance of variables used in the model using reduction of the Gini index, thereby solving the "black box" problem and making the model more interpretable to some extent, allowing us to identify important variables. Therefore, we selected RF as the ML algorithm for the predictive model in this study.
Due to the imbalance in the entire dataset, the eligible study population was under sampled at a ratio of 1:1 in both critical cases and hospitalization cases using the python package 'imbalanced-learn' [33]. Each under sampled dataset was subjected to model derivation and testing through a 5-fold cross validation process. The "RandomForestClassifier" function of Python's Scikit-Learn library was used for RF model development and testing. The default value of this function was used for the remaining the hyperparameters except for the number of trees ("n_estimator"). Regarding the number of trees, a value between 10 and 1000 showing excellent performance was used [34]. In addition, we also compared the performance of our models with that of pedKTAS, which served as the reference model.

Data analysis
All data handling, statistical analysis and machine learning were performed with R platform version 3.6.3 (R Foundation for Statistical Computing, Vienna, Austria). Continuous variables were reported using medians and interquartile ranges (IQRs), and categorical variables were reported using frequencies and proportions.
The performance of the RF models was assessed by calculating the area under the receiver operating characteristic curve (AUROC) and the area under the precision-recall curve (AUPRC) with the 95% confidence interval (CI). In addition, the importance of each variable was calculated by decreasing the Gini index, and Scikit-Learn's "feature importance" function was used [34].

Results
A total of 2,621,710 ED visits were made by children younger than 15 years old and classified by the pedKTAS during the study period. As described above, the eligible cases were under sampled at a ratio of 1:1 to overcome imbalance for both the critical case group and hospitalization group. The basic demographic and clinical characteristics of the cases included in the analysis are summarized in Table 1. Overall, 12,951 (0.5%) patients had critical clinical outcomes (critical cases), and 303,808 (11.6%) patients were hospitalized (hospitalization). For the total population, the median age was 3.0 years old (IQR 1.0-7.0), and 57.2% of the children were male. Among the eligible patients, 22,359 (0.9%) had an unknown disease/injury category, and 71,976 (2.7%) had an unknown initial mental status. Fig 1 presents the classification results of the RF models and the comparison with pedKTAS for both outcomes. For the prediction of critical cases, the AUROC was 0.973 (95% CI 0.971-0.977) in the under sampled dataset (Fig 1A) and 0.991 (95% CI 0.991-0.992) in the validation dataset ( Fig 1B). For the prediction of hospitalization, the AUROC was 0.819 in the under sampled dataset (Fig 1C) and 0.943 (95% CI 0.943-0.944) in the validation dataset ( Fig 1D). For validation with the entire dataset, we also compared the prediction performance with that of pedKTAS.
Additionally, the AUPRC of RF models for the under sampled dataset and validation with the entire dataset, including comparison with pedKTAS, are shown in Fig 2. The AUPRC was 0.977 (95% CI 0.974-0.979) in the under sampled dataset and 0.640 (95% CI 0.633-0.648) in the validation dataset ( Fig 2B). However, the AUPRC was 0.819 (95% CI 0.817-0.821) in the under sampled dataset and 0.729 (95% CI 0.728-0.73) in the validation dataset. In validation, the performance was compared with that of the conventional triage system (pedKTAS) .  Fig 3 graphically displays the predictor variable importance. For critical cases, age was the most important variable followed by respiratory rate, heart rate, arrival at other vehicles, and body temperature. For hospitalization, age was also the most important variable with body temperature being a close second. The other important variables for hospitalization were time from onset to ED visit, heart rate, and respiratory rate.

Discussion
In this study, using national data from 2,621,710 ED visits by children, we developed and compared several ML models to predict critical cases and hospitalization upon arrival at the ED.

PLOS ONE
The RF model presented good performance discriminating critical cases with an AUROC of 0.991 and AUPRC of 0.640 from the limited information provided at the initial presentation. To our knowledge, this is the first attempt to use an ML algorithm for predicting outcomes in pediatric ED visitors using nationwide data. The RF model achieved higher performance than the conventional clinical prediction rules in predicting both critical cases and hospitalization among pediatric ED visitors with AUROC values of 0.991 versus 0.844 and 0.943 versus 0.680, respectively. These scoring systems have better performance than pre-existing scoring systems for clinical prediction [10,17,35]. These conventional methods consist of fewer variables and use a linear model with few interactions, whereas ML can perform highorder calculations. In a previous study by Goto et al [28] that predicted pediatric outcomes in ED triage based on an ML model, the performance of the RF model had an AUROC of 0.85 (95% CI 0.79-0.91) for critical cases and 0.80 (95% CI 0.78-0.81) for hospital admissions. The improvement in the AUROC in our study may be due to the greater number of predictor variables used in our analysis. Additionally, there were slight differences in the choice of variables. In the abovementioned study, the important predictors for critical care included age, vital signs, and arrival mode. In contrast, our study showed a similar pattern for the importance of variables, except for 'level of consciousness' and 'time from onset to ED visit'. These variables were of high importance in our study and were not included in the analysis of the previous study.
In another study, a gradient boosting model was used to predict mortality in an adult population with AUROC values ranging from 0.949-0.960 [36]. Give that this study was conducted in a single institute, it was possible to obtain more detailed variables, such as 'unstructured chief complaint' or 'number of days to previous ED visit'. Our study used only highly qualitycontrolled and structured data, which did not completely utilize the various abilities of ML. Integration of unstructured data, such as text data, into the algorithm may present new possibilities. In addition to the abovementioned study, Choi et al. [37] showed that the addition of text data improves the predictive performance of ML triage compared to that of a model using only structured data. Lucini et al. [38] predicted the need for hospitalization based on written records of the first medical assessment in the ED using text-mining approaches.
There are some limitations of our study. First, although our model showed high AUROC values of 0.991 (for critical cases) and 0.943 (for hospitalization), the AUPRC of the entire dataset was low (0.640 for critical cases and 0.729 for hospitalization), which was probably due to the imbalanced dataset [39]. Critical cases accounted for only 0.5% (n = 12,951) of the total population, and we tried to overcome imbalance using the under sampling method. With under sampled training data, the AUPRC was higher (0.977 for critical cases and 0.819 for hospitalization) than the validation with the entire dataset. In predicting hospitalization, the RF model showed a lower AUROC than predicting critical cases. Moreover, it showed better AUPRC than AUPRC for predicting critical cases, which was probably due to a larger number of children being hospitalized (n = 303,808).
Second, although we used nationwide data, some bias is possible. As mentioned in the methods, some variables has missing values that we had to impute to classify as 'unknown'. Additionally, input errors from each hospital could occur. However, the NEDIS dataset is quality-controlled by the National Emergency Medical Center of Korea and regularly undergoes a quality assessment process [40,41].
Finally, although this study used a large dataset from a nationwide registry, further studies in other countries and/or prospective validation must be performed. However, for the prospective validation of our ML model, the development of an EMR-embedded program with automatic calculation will be appropriate and must precede the experiment.

Conclusions
ML models using structured triage data from a nationwide database can more effectively predict critical cases and hospitalizations among pediatric ED visitors than the conventional triage method. Age was the most important predictor for both ED outcomes, but importance of the other predictors differs between critical cases and hospitalization. Although prospective validation and integration of unstructured data are needed, the results of this study can support advances in pediatric triage and resource distribution in PED. Top thirty predictors with the highest importance for each outcome. The importance of each feature was calculated through information gain using the difference in Gini impurity reduction. The "feature importance" function of Python's scikit-learn library was used [34]. A. Critical cases, and B. hospitalization. https://doi.org/10.1371/journal.pone.0264184.g003 Supporting information S1